The following is an analysis for the clusters constructed using kmeans algorithm. The database was created as followed:
Since kmeans, as the majority of cluster algorithms, tend to be memory-intensive due to calculations of similarity matrices, we have to performed some preprocess to reduced the size of the database, i.e. reduced the column space,which can be synthesized as followed:
For the number of clusters we used the elbow plot and silhouette plot to assess this. We run this measurements in Spark since local computations run out of memory (we use it as a proxy).
Although the elbow plot is not as smooth as we would like, looking the decrease in total within sum of squares, we see a large fall with 7 clusters, despite perhaps in silhouette we can see 6 achieve a good measurement (other analysis could we made with other number of clusters).
It is important to stress that this measurements are subjective since a priori there are no labels to assess the goodness of the fit.
In this case, using the kmeans inside stat package we can also see that the most reduce in the variance seems to be achieved around 6 or 7. So even though the difference betweeen the libraries we can see that this data behaves pretty much similar.
For this case, we see using this components that the big majority of points are close by, just a few drift away from that zone.
When plot the standard deviation by column, we can see that many are centered around zero, which could be explained by two possible factors:
Actually as have assessed by different models, such as lifetime value, most of our users perform little or none activity.
In this case, performing the same by cluster, we can see that standard deviation are close to zero, which could be something desirable, because it will lead to the idea that features inside clusters are similar. We’ll see if that is the case or has to do more with what we saw in the general histogram.
| cluster | n |
|---|---|
| 1 | 831199 |
| 2 | 338275 |
| 3 | 43741 |
| 4 | 1408 |
| 5 | 361953 |
| 6 | 126 |
| 7 | 160784 |
In comparison, cluster 1 has the largest number of members, also we can identify some cluster with a few counts, cluster 6 has the lowest number, 126, in this case.
In general, taking into account the complete years since creation of the account, we can see that cluster 1 has a large proportion of old time users, this is also true for cluster 4. Despite the low number, cluster 6 has a large amount of old time user. Cluster 5 has the most recent users. It is not that surprising that most clusters have a high percentage of 0 complete years users, since about 50% of studied users fulfills this.
For this case, other_registered_app means we don’t have the record of the registered method the customer used, this is specially true for old time users, that’s why cluster 1 and 6 have it so high. Cluster 5 have a huge amount that used app as the creation method.
Even though the amount of possible labels, most of them are type w or n.
Analyzing risk level, cluster 1 and 2 have a big majority of low risk level, could be due to little or none movements, cluster 3 and 7 have a mixed balance between low and medium. The majority in cluster 6 is medium. For cluster 5 (the newest members) a big proportion are not labeled yet.
As we discussed cluster 5 is the one with least metadata, also true for cluster 1, most likely has to do with troubles while capturing the data for old users. In most clusters the distribution of genders are quite similar. Except from cluster 3 and 6 where there are more men.
In this case, other_preferred_currency_brl refers to ars. Since brl is recent, is difficult to see any participation inside clusters. As expected mxn has a huge take in almost all cluster. We have some with high proportion of None.
For this case, other_all_time_tribe_dynamic_bitbank refers to alpha. Using this information we can see what we expected, cluster 1 has not made any transaction in the months we are evaluating, therefore could be labeled as the inactive users. Cluster 2 is also a young one (given complete whole years), but there is a huge proportion that used bitbank in the period of interest, could be the rookies one. For cluster 3 and 4 we can see users that seem to be activate in most of the services we give. Cluster 5 is people that didn’t make any move, since its ‘youth’ this could be a target for campaigns to introduce them into crypto. Cluster 6 since it’s seniority could be labeled as ‘crypto savvies’. Cluster 7 has a big proportion that use bitbank, perhaps crypto enthusiast.
In this case, other_age_less_than_35 refers to 25 years or below. In this case we can see that cluster 2, 3 and 7 has a large proportion of young ones. Cluster 4 and 6 are middle-aged users.
For this case, other_country_code_BR refers to AR. Cluster 5 is the most balance between MX and AR. Cluster 3 and 4 have practically the same distribution.
There are no huge differences in this field between clusters. Could change when Brazil comes into play and gain relevance.
In this case, other_has_Android is None. Since a user can have more than one device this doesn’t add to 1. Cluster 3,4 and 6 seems to be active using desktop, this could be where alpha users are mostly located. Since cluster 2 has a huge proportion using Android this could mean there are in fact newbies. Despite the null number of lack of transactions in cluster 5, we see some that use their device (Android).
Taking aside cluster 1 and 5 the big majority either use one or two devices.
We see the histogram of failed deposits. As expected cluster 1 and 5 are concentrated on zero, in this case, due to lack of activity.
We see the distribution of weeks of inactivity on web by cluster. Cluster 4 seems to be the more active, due to the fact the usage of web. Also cluster 3 and 6 seems to be quite active.
It is interesting to see that mx levels are left skewed since the mean is less than median. In general we can find above 1 mx user level in clusters 3,4,6 and 7. As expected since there is no natural cluster for argentinians or brazilians is hard to interpret their respectiv levels due to the large volume of mexican users that overshadow their mean and median.
## [[1]]
##
## [[2]]
## [[1]]
##
## [[2]]
We can see that most likely, cluster 4 holds active traders from Argentina and Brazil since we can see transactions in ars and brl (the mean is skewed since most users are mexican). Actually seeing the median we can see that cluster 3 are active users that trade in average low quantities and are focus on well-known crypto such as btc, eth and xrp. Aside from what we saw about cluster 4, it can be seen that also trades a higher amount compared to cluster 3. For cluster 6 we see this has to do with the cluster with highest amount traded, they are agnostic since they trade on several cryptos. As expected cluster 7 looks like enthusiast since you can see some transactions made on btc.
## [[1]]
##
## [[2]]
## [[1]]
##
## [[2]]
As we guessed the average amount traded in ars for cluster 4 is high, we can also see some transactions made in brl, but we should remember that this mean is skewed due to the fact most users are mexican which adds a lots zeros to the count (see the median). And for cluster 6, we confirm this is built with high rollers in most cryptocurrencies.
Because of the difference in scale let’s see this in a tabular form to grasp a little bit about the other clusters:
| cluster | type | mean | median |
|---|---|---|---|
| 1 | ars | 0.2112993 | 0.00000 |
| 1 | bat | 0.4314945 | 0.00000 |
| 1 | bch | 0.4731486 | 0.00000 |
| 1 | brl | 0.0002398 | 0.00000 |
| 1 | btc | 14.2778544 | 0.00000 |
| 1 | dai | 0.2178348 | 0.00000 |
| 1 | eth | 4.1592974 | 0.00000 |
| 1 | gnt | 0.2189814 | 0.00000 |
| 1 | ltc | 0.7519551 | 0.00000 |
| 1 | mana | 0.8579943 | 0.00000 |
| 1 | mxn | 23.9647433 | 0.00000 |
| 1 | tusd | 0.5419265 | 0.00000 |
| 1 | usd | 1.4640902 | 0.00000 |
| 1 | xrp | 5.3412555 | 0.00000 |
| 2 | ars | 11.2215169 | 0.00000 |
| 2 | bat | 1.4734147 | 0.00000 |
| 2 | bch | 1.9121872 | 0.00000 |
| 2 | brl | 0.0000129 | 0.00000 |
| 2 | btc | 67.4288081 | 0.00000 |
| 2 | dai | 1.7650056 | 0.00000 |
| 2 | eth | 16.0679963 | 0.00000 |
| 2 | gnt | 0.6577463 | 0.00000 |
| 2 | ltc | 3.9241474 | 0.00000 |
| 2 | mana | 3.4699955 | 0.00000 |
| 2 | mxn | 92.2497682 | 0.00000 |
| 2 | tusd | 1.7868063 | 0.00000 |
| 2 | usd | 5.1548557 | 0.00000 |
| 2 | xrp | 17.8011829 | 0.00000 |
| 3 | ars | 294.9630687 | 0.00000 |
| 3 | bat | 124.5147034 | 0.00000 |
| 3 | bch | 128.0204531 | 0.00000 |
| 3 | brl | 0.3658233 | 0.00000 |
| 3 | btc | 2680.3316257 | 137.23587 |
| 3 | dai | 124.8990668 | 0.00000 |
| 3 | eth | 1051.9567921 | 11.34188 |
| 3 | gnt | 25.2477322 | 0.00000 |
| 3 | ltc | 251.1133266 | 0.00000 |
| 3 | mana | 341.8657934 | 0.00000 |
| 3 | mxn | 4465.2871545 | 288.84482 |
| 3 | tusd | 44.0943652 | 0.00000 |
| 3 | usd | 643.5785670 | 0.00000 |
| 3 | xrp | 1099.5573218 | 12.09361 |
| 4 | ars | 16305.2573876 | 0.00000 |
| 4 | bat | 259.4997934 | 0.00000 |
| 4 | bch | 466.8572836 | 0.00000 |
| 4 | brl | 161.5110896 | 0.00000 |
| 4 | btc | 55365.1426842 | 10615.90422 |
| 4 | dai | 3297.1945085 | 0.00000 |
| 4 | eth | 15608.5525292 | 12.13719 |
| 4 | gnt | 72.9533355 | 0.00000 |
| 4 | ltc | 2177.8425895 | 0.00000 |
| 4 | mana | 516.1614197 | 0.00000 |
| 4 | mxn | 58130.9471472 | 12725.98113 |
| 4 | tusd | 2657.1101698 | 0.00000 |
| 4 | usd | 7293.8458041 | 0.00000 |
| 4 | xrp | 8952.3220170 | 0.00000 |
| 5 | ars | 0.0001266 | 0.00000 |
| 5 | bat | 0.0000503 | 0.00000 |
| 5 | bch | 0.0000000 | 0.00000 |
| 5 | brl | 0.0000000 | 0.00000 |
| 5 | btc | 0.0292330 | 0.00000 |
| 5 | dai | 0.0000283 | 0.00000 |
| 5 | eth | 0.0005405 | 0.00000 |
| 5 | gnt | 0.0002804 | 0.00000 |
| 5 | ltc | 0.0005762 | 0.00000 |
| 5 | mana | 0.0000470 | 0.00000 |
| 5 | mxn | 0.0088186 | 0.00000 |
| 5 | tusd | 0.0206033 | 0.00000 |
| 5 | usd | 0.0003556 | 0.00000 |
| 5 | xrp | 0.0016481 | 0.00000 |
| 6 | ars | 3184.8241361 | 0.00000 |
| 6 | bat | 23470.7630938 | 7077.70019 |
| 6 | bch | 15677.8226845 | 2937.87594 |
| 6 | brl | 1.2185312 | 0.00000 |
| 6 | btc | 299748.8885921 | 11310.77696 |
| 6 | dai | 5964.4000534 | 0.00000 |
| 6 | eth | 86984.4845313 | 11696.61853 |
| 6 | gnt | 7144.4274848 | 0.00000 |
| 6 | ltc | 23494.3589623 | 4637.72789 |
| 6 | mana | 43775.2680036 | 10065.04051 |
| 6 | mxn | 504563.8464155 | 234329.93673 |
| 6 | tusd | 3215.8646285 | 0.00000 |
| 6 | usd | 58148.9289910 | 0.00000 |
| 6 | xrp | 122893.5036711 | 41656.45650 |
| 7 | ars | 142.2913348 | 0.00000 |
| 7 | bat | 16.1935363 | 0.00000 |
| 7 | bch | 23.8907525 | 0.00000 |
| 7 | brl | 0.0018471 | 0.00000 |
| 7 | btc | 1064.9173274 | 123.86606 |
| 7 | dai | 17.6353851 | 0.00000 |
| 7 | eth | 272.9582404 | 0.00000 |
| 7 | gnt | 3.6934914 | 0.00000 |
| 7 | ltc | 57.9613583 | 0.00000 |
| 7 | mana | 36.0090167 | 0.00000 |
| 7 | mxn | 1420.4379100 | 100.69928 |
| 7 | tusd | 24.0129940 | 0.00000 |
| 7 | usd | 41.0763741 | 0.00000 |
| 7 | xrp | 176.4030589 | 0.00000 |
Aside from cluster 3,4 and 7 which we identified as active, it is interesting to see that cluster 4 does deposits using crypto. And despite the large amount traded by cluster 6 it doesn’t, on median, make deposits.
Now looking at the amount not only cluster 4 make the most, but also the amount seems to be pretty high
Not only cluster 4 makes a large number of deposits also they tend to be those with most withdrawal count
Not only cluster 4 present the most count of withdrawal we can see that it also have on average the highest amount. For clusters 3 and 7 we see some withdrawals.
Let’s see how is the ration between withdrawals and deposits between clusters:
For this graphic we would like to see a number below 1, that would mean than on average (median) the deposit amount is greater than the amount withdrew. For instance in cluster 1, even though the big majority is inactive those that actually do stuff tend to withdraw more. What is specially interesting in cluster 7 is that on median has a high withdrawal rate on crypto.
As we can see cluster 6 has produced the most trading revenue, followed by cluster 4. Clusters 3 and 7 produced a low amount but can be explained by the low amount they usually move.
## [[1]]
##
## [[2]]
## [[1]]
##
## [[2]]
Due to the scale let’s see it a tabular way:
| cluster | type | mean | median |
|---|---|---|---|
| 1 | ars | 0.0147727 | 0.0000000 |
| 1 | bat | 0.3496897 | 0.0000000 |
| 1 | bch | 0.7653875 | 0.0000000 |
| 1 | brl | 0.0000237 | 0.0000000 |
| 1 | btc | 51.2437029 | 0.0000000 |
| 1 | dai | 0.0883461 | 0.0000000 |
| 1 | eth | 13.7780568 | 0.0000000 |
| 1 | gnt | 0.1102653 | 0.0000000 |
| 1 | ltc | 1.5689335 | 0.0000000 |
| 1 | mana | 0.4572292 | 0.0000000 |
| 1 | mxn | 11.3148923 | 0.0000000 |
| 1 | tusd | 0.7035981 | 0.0000000 |
| 1 | usd | 0.6178169 | 0.0000000 |
| 1 | xrp | 7.0950943 | 0.0000000 |
| 2 | ars | 0.7993179 | 0.0000000 |
| 2 | bat | 1.0892615 | 0.0000000 |
| 2 | bch | 1.5063943 | 0.0000000 |
| 2 | brl | 0.0000040 | 0.0000000 |
| 2 | btc | 72.6493848 | 0.0000000 |
| 2 | dai | 0.4899950 | 0.0000000 |
| 2 | eth | 20.4183860 | 0.0000000 |
| 2 | gnt | 0.4017417 | 0.0000000 |
| 2 | ltc | 3.4793228 | 0.0000000 |
| 2 | mana | 1.7416882 | 0.0000000 |
| 2 | mxn | 19.4069318 | 0.0000000 |
| 2 | tusd | 1.4115735 | 0.0000000 |
| 2 | usd | 1.7571409 | 0.0000000 |
| 2 | xrp | 8.4867263 | 0.0000000 |
| 3 | ars | 8.2141999 | 0.0000000 |
| 3 | bat | 42.2540976 | 0.0000000 |
| 3 | bch | 51.7602941 | 0.0000000 |
| 3 | brl | 0.1336175 | 0.0000000 |
| 3 | btc | 843.0746194 | 48.5540408 |
| 3 | dai | 9.7840988 | 0.0000000 |
| 3 | eth | 372.1308637 | 5.9239765 |
| 3 | gnt | 13.8951462 | 0.0000000 |
| 3 | ltc | 94.0817356 | 0.0000000 |
| 3 | mana | 75.9000178 | 0.0000000 |
| 3 | mxn | 254.5379471 | 7.7363446 |
| 3 | tusd | 12.1163676 | 0.0000000 |
| 3 | usd | 50.6212599 | 0.0000000 |
| 3 | xrp | 143.7373807 | 0.9205039 |
| 4 | ars | 116.5282226 | 0.0000000 |
| 4 | bat | 122.6251584 | 0.0000000 |
| 4 | bch | 343.5955620 | 0.0000000 |
| 4 | brl | 8.4569486 | 0.0000000 |
| 4 | btc | 24945.7397394 | 1435.5719316 |
| 4 | dai | 281.2001948 | 0.0000000 |
| 4 | eth | 8294.5231888 | 1.6043672 |
| 4 | gnt | 47.5240024 | 0.0000000 |
| 4 | ltc | 937.6029419 | 0.0000000 |
| 4 | mana | 239.2104121 | 0.0000000 |
| 4 | mxn | 7447.4708939 | 298.6050789 |
| 4 | tusd | 939.2980623 | 0.0000000 |
| 4 | usd | 601.6008956 | 0.0000000 |
| 4 | xrp | 1651.7182750 | 0.0000003 |
| 5 | ars | 0.0000277 | 0.0000000 |
| 5 | bat | 0.0000783 | 0.0000000 |
| 5 | bch | 0.0000171 | 0.0000000 |
| 5 | brl | 0.0000000 | 0.0000000 |
| 5 | btc | 0.1198917 | 0.0000000 |
| 5 | dai | 0.0003718 | 0.0000000 |
| 5 | eth | 0.0036547 | 0.0000000 |
| 5 | gnt | 0.0000409 | 0.0000000 |
| 5 | ltc | 0.0033901 | 0.0000000 |
| 5 | mana | 0.0001550 | 0.0000000 |
| 5 | mxn | 0.0125045 | 0.0000000 |
| 5 | tusd | 0.0034436 | 0.0000000 |
| 5 | usd | 0.0002694 | 0.0000000 |
| 5 | xrp | 0.0017034 | 0.0000000 |
| 6 | ars | 4.7732684 | 0.0000000 |
| 6 | bat | 6371.7574460 | 1076.3272830 |
| 6 | bch | 5875.0406399 | 278.9172442 |
| 6 | brl | 0.1379106 | 0.0000000 |
| 6 | btc | 26401.0528662 | 1339.3147564 |
| 6 | dai | 67.3326356 | 0.0000000 |
| 6 | eth | 15681.7967475 | 1447.5996213 |
| 6 | gnt | 2671.2853893 | 0.0000000 |
| 6 | ltc | 8773.6324739 | 447.1486673 |
| 6 | mana | 6763.2390084 | 963.9592011 |
| 6 | mxn | 14924.8541065 | 3855.0531108 |
| 6 | tusd | 223.8310011 | 0.0000000 |
| 6 | usd | 5912.3996321 | 0.0000000 |
| 6 | xrp | 14913.1823897 | 1906.1758859 |
| 7 | ars | 12.7282006 | 0.0000000 |
| 7 | bat | 9.2082770 | 0.0000000 |
| 7 | bch | 14.3135216 | 0.0000000 |
| 7 | brl | 0.0008607 | 0.0000000 |
| 7 | btc | 603.4947665 | 32.5666438 |
| 7 | dai | 2.4499374 | 0.0000000 |
| 7 | eth | 177.8922316 | 0.0000000 |
| 7 | gnt | 2.4045644 | 0.0000000 |
| 7 | ltc | 32.1738003 | 0.0000000 |
| 7 | mana | 13.8020591 | 0.0000000 |
| 7 | mxn | 155.7882137 | 0.0865314 |
| 7 | tusd | 5.9136768 | 0.0000000 |
| 7 | usd | 8.0801266 | 0.0000000 |
| 7 | xrp | 47.0252066 | 0.0000000 |
In most active clusters we see that normally hold popular cryptos. As we spotted before cluster 4 most likely has the biggest traders for ars and brl. And cluster 6 has the one with the highest amount of mean balance in practically all cryptocurrencies.
For this case there’s no clear pattern between clusters for the average portfolio return in the period.
## [[1]]
##
## [[2]]
##
## [[3]]
## [[1]]
##
## [[2]]
##
## [[3]]
## [[1]]
##
## [[2]]
##
## [[3]]
For this case, in most active cluster there is, on average, activity in events related to usage of our services (2fa, logins, etc). Observing the median we can see that cluster 2 and 7 reflects most events related to new users. The low values in this rubric for cluster 5 might suggest they just opened their account and did nothing else latter.
## [[1]]
##
## [[2]]
In this case this variables are binary, we can see that most cluster except for 1 and 5 tend to have some activity. Again cluster 5 shows little to none interaction from users that are relatively new, but we can see that over 25% opens the app. Some events has to do with alpha in which we can see cluster 3,4,6,7 having some activity.
Aside from adwords where we can see some relevant participation from young clusters (5 and 2 with around 40% and 20% respectively) it does not seem that marketing campaigns have any influence over the clusters, that is, those that received communication are distributed across all.
Using this high overview of the characteristics between clusters we can conclude some information:
It’s important to notice that, although we gave some labels to the clusters, some of the users may not reflect the exact nature of the specific cluster they are in. This could be because there is some interference in the signals given the large proportion of inactive users, which could create near zero variables as we’ ve seen on the histogram at the beginning. A further exercise could be run the algorithm excluding this cluster in order to see the differences in grouping the algorithm will provide.